import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
plt.style.use('ggplot')
pd.set_option('display.max_columns', 40)
# load the two csv files as dataframe
price_demand = pd.read_csv('price_demand_data.csv')
weather = pd.read_csv('weather_data.csv')
price_demand.head()
| REGION | SETTLEMENTDATE | TOTALDEMAND | PRICECATEGORY | |
|---|---|---|---|---|
| 0 | VIC1 | 1/01/2021 0:30 | 4179.21 | LOW |
| 1 | VIC1 | 1/01/2021 1:00 | 4047.76 | LOW |
| 2 | VIC1 | 1/01/2021 1:30 | 3934.70 | LOW |
| 3 | VIC1 | 1/01/2021 2:00 | 3766.45 | LOW |
| 4 | VIC1 | 1/01/2021 2:30 | 3590.37 | LOW |
from IPython.display import display, Markdown
#check dimension
rows, columns = price_demand.shape
data = {'Rows': [rows], 'Columns': [columns]}
dimensions_table = pd.DataFrame(data)
display(Markdown("#### Dimensions of price_demand"))
display(dimensions_table)
| Rows | Columns | |
|---|---|---|
| 0 | 11664 | 4 |
# check variables type
dtypes_pd = pd.DataFrame(price_demand.dtypes, columns=['Type'])
display(Markdown("#### Data types of price_demand"))
display(dtypes_pd)
| Type | |
|---|---|
| REGION | object |
| SETTLEMENTDATE | object |
| TOTALDEMAND | float64 |
| PRICECATEGORY | object |
# statistacal summary
description = price_demand.describe()
caption = "#### Summary of 'TOTALDEMAND'"
display(Markdown(f"{caption}"))
display(description)
| TOTALDEMAND | |
|---|---|
| count | 11664.000000 |
| mean | 4925.798454 |
| std | 876.407490 |
| min | 2708.530000 |
| 25% | 4255.500000 |
| 50% | 4803.755000 |
| 75% | 5477.337500 |
| max | 8196.830000 |
#check missing values
price_demand_missing = price_demand.isna().sum()
missingv = pd.DataFrame(price_demand_missing, columns=['Missing Values']).reset_index().rename(columns={'index': 'Column'})
display(Markdown("#### Missing Values in price_demand"))
display(missingv)
| Column | Missing Values | |
|---|---|---|
| 0 | REGION | 0 |
| 1 | SETTLEMENTDATE | 0 |
| 2 | TOTALDEMAND | 0 |
| 3 | PRICECATEGORY | 0 |
The 'price_demand' dataset comprises energy price and demand figures for the state of Victoria, Australia, during the period between January and August 2021. This information has been sourced from the Australian Energy Market Operator (AEMO) and provides valuable insights into the region's energy market dynamics.
The dataset contains 11,644 observations and 4 variables, captured at 30-minute intervals from 01/01/2021 to 01/09/2021.
1 Region: State (Categorical). 2 TOTALDEMAND: A numerical variable representing the energy demand per 30 mins(Numerical). 3 DATE: A variable indicating the date and time of the observation (datetime64). 4 Price Category: Categorical description of the energy price level (Categorical).
The 'price_demand' dataset is complete and contains no missing values, ensuring a high level of data integrity and reliability for analysis purposes.
weather.head()
| Date | Minimum temperature (°C) | Maximum temperature (°C) | Rainfall (mm) | Evaporation (mm) | Sunshine (hours) | Direction of maximum wind gust | Speed of maximum wind gust (km/h) | Time of maximum wind gust | 9am Temperature (°C) | 9am relative humidity (%) | 9am cloud amount (oktas) | 9am wind direction | 9am wind speed (km/h) | 9am MSL pressure (hPa) | 3pm Temperature (°C) | 3pm relative humidity (%) | 3pm cloud amount (oktas) | 3pm wind direction | 3pm wind speed (km/h) | 3pm MSL pressure (hPa) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/01/2021 | 15.6 | 29.9 | 0.0 | 2.8 | 9.3 | NNE | 31.0 | 13:14 | 19.2 | 77.0 | 6 | N | 2 | 1018.8 | 28.1 | 43 | 5.0 | E | 13 | 1015.3 |
| 1 | 2/01/2021 | 18.4 | 29.0 | 0.0 | 9.4 | 1.3 | NNW | 30.0 | 8:22 | 23.3 | 52.0 | 7 | NNW | 17 | 1013.3 | 28.7 | 38 | 7.0 | SW | 4 | 1008.5 |
| 2 | 3/01/2021 | 17.0 | 26.2 | 12.6 | 4.8 | 7.1 | WSW | 33.0 | 17:55 | 18.3 | 100.0 | 8 | WSW | 4 | 1007.7 | 23.5 | 59 | 4.0 | SSW | 2 | 1005.2 |
| 3 | 4/01/2021 | 16.0 | 18.6 | 2.6 | 3.8 | 0.0 | SSE | 41.0 | 16:03 | 16.2 | 98.0 | 8 | SSE | 11 | 1010.0 | 18.2 | 82 | 8.0 | SSW | 17 | 1011.0 |
| 4 | 5/01/2021 | 15.9 | 19.1 | 11.2 | 1.0 | 0.0 | SSE | 35.0 | 11:02 | 17.2 | 96.0 | 8 | SSE | 13 | 1012.5 | 18.2 | 82 | 8.0 | SSE | 19 | 1013.3 |
#check dimension
rows, columns = weather.shape
data1 = {'Rows': [rows], 'Columns': [columns]}
dimensions_table1 = pd.DataFrame(data1)
display(Markdown("#### Dimensions of weather"))
display(dimensions_table1)
| Rows | Columns | |
|---|---|---|
| 0 | 243 | 21 |
# statistical summary
weather.describe()
| Minimum temperature (°C) | Maximum temperature (°C) | Rainfall (mm) | Evaporation (mm) | Sunshine (hours) | Speed of maximum wind gust (km/h) | 9am Temperature (°C) | 9am relative humidity (%) | 9am cloud amount (oktas) | 9am MSL pressure (hPa) | 3pm Temperature (°C) | 3pm relative humidity (%) | 3pm cloud amount (oktas) | 3pm MSL pressure (hPa) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 242.000000 | 242.000000 | 241.000000 | 243.000000 | 243.000000 | 240.000000 | 242.000000 | 242.000000 | 243.000000 | 241.000000 | 243.000000 | 243.000000 | 242.000000 | 242.000000 |
| mean | 11.050826 | 19.445868 | 1.576763 | 3.902469 | 5.349383 | 34.412500 | 13.720661 | 74.454545 | 5.164609 | 1017.740664 | 18.040329 | 56.930041 | 5.301653 | 1015.824793 |
| std | 3.870242 | 5.354085 | 4.498754 | 2.702141 | 3.604902 | 10.909319 | 4.306618 | 14.177593 | 2.562778 | 7.683402 | 4.963547 | 14.017376 | 2.392051 | 7.435859 |
| min | 1.700000 | 10.600000 | 0.000000 | 0.000000 | 0.000000 | 15.000000 | 3.000000 | 25.000000 | 0.000000 | 989.700000 | 8.600000 | 21.000000 | 0.000000 | 989.000000 |
| 25% | 8.100000 | 15.500000 | 0.000000 | 1.900000 | 2.150000 | 28.000000 | 10.925000 | 65.000000 | 3.000000 | 1012.800000 | 14.400000 | 48.000000 | 3.000000 | 1011.000000 |
| 50% | 10.900000 | 18.300000 | 0.000000 | 3.200000 | 4.900000 | 33.000000 | 13.400000 | 75.000000 | 7.000000 | 1018.100000 | 17.100000 | 56.000000 | 7.000000 | 1015.750000 |
| 75% | 13.800000 | 21.800000 | 0.600000 | 5.600000 | 8.350000 | 41.000000 | 16.400000 | 84.000000 | 7.000000 | 1023.700000 | 20.150000 | 66.000000 | 7.000000 | 1021.600000 |
| max | 22.200000 | 39.200000 | 43.200000 | 13.800000 | 13.100000 | 67.000000 | 30.900000 | 100.000000 | 8.000000 | 1034.200000 | 35.200000 | 98.000000 | 8.000000 | 1032.400000 |
#check data type
dtypes_w = pd.DataFrame(weather.dtypes, columns=['Type'])
dtypes_w1=dtypes_w.T
display(Markdown("#### Data types of weather"))
display(dtypes_w1)
| Date | Minimum temperature (°C) | Maximum temperature (°C) | Rainfall (mm) | Evaporation (mm) | Sunshine (hours) | Direction of maximum wind gust | Speed of maximum wind gust (km/h) | Time of maximum wind gust | 9am Temperature (°C) | 9am relative humidity (%) | 9am cloud amount (oktas) | 9am wind direction | 9am wind speed (km/h) | 9am MSL pressure (hPa) | 3pm Temperature (°C) | 3pm relative humidity (%) | 3pm cloud amount (oktas) | 3pm wind direction | 3pm wind speed (km/h) | 3pm MSL pressure (hPa) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Type | object | float64 | float64 | float64 | float64 | float64 | object | float64 | object | float64 | float64 | int64 | object | object | float64 | float64 | int64 | float64 | object | object | float64 |
#check missing values
weather_missing = weather.isna().sum()
missingv1 = pd.DataFrame(weather_missing, columns=['Missing Values']).reset_index().rename(columns={'index': 'Column'})
display(Markdown("#### Missing Values in weather"))
display(missingv1)
| Column | Missing Values | |
|---|---|---|
| 0 | Date | 0 |
| 1 | Minimum temperature (°C) | 1 |
| 2 | Maximum temperature (°C) | 1 |
| 3 | Rainfall (mm) | 2 |
| 4 | Evaporation (mm) | 0 |
| 5 | Sunshine (hours) | 0 |
| 6 | Direction of maximum wind gust | 3 |
| 7 | Speed of maximum wind gust (km/h) | 3 |
| 8 | Time of maximum wind gust | 3 |
| 9 | 9am Temperature (°C) | 1 |
| 10 | 9am relative humidity (%) | 1 |
| 11 | 9am cloud amount (oktas) | 0 |
| 12 | 9am wind direction | 1 |
| 13 | 9am wind speed (km/h) | 1 |
| 14 | 9am MSL pressure (hPa) | 2 |
| 15 | 3pm Temperature (°C) | 0 |
| 16 | 3pm relative humidity (%) | 0 |
| 17 | 3pm cloud amount (oktas) | 1 |
| 18 | 3pm wind direction | 0 |
| 19 | 3pm wind speed (km/h) | 0 |
| 20 | 3pm MSL pressure (hPa) | 1 |
# display details of missing values
total_cells_missing = weather.isnull().sum().sum()
total_rows_missing = weather.isnull().any(axis=1).sum()
total_cols_missing = weather.isnull().any(axis=0).sum()
data = {"Total cells with missing values": [total_cells_missing],
"Total rows with missing values": [total_rows_missing],
"Total columns with missing values": [total_cols_missing],}
missing_data_summary = pd.DataFrame(data)
display(Markdown("#### Missing data summary"))
display(missing_data_summary)
| Total cells with missing values | Total rows with missing values | Total columns with missing values | |
|---|---|---|---|
| 0 | 21 | 6 | 13 |
The 'weather' dataset comprises key weather indicators for the city of Melbourne, Australia, during the period between January and August 2021. This information has been sourced from the Bureau of Meteorology and provides valuable insights into the region's weather patterns and climate.
The dataset contains 243 observations and 21 variables, captured daily from 01/01/2021 to 31/08/2021.
Here is a summary of the variable types:
Numerical: 15 variables Categorical (string): 4 variables Datetime64: 1 variable Integer: 1 variable
The 'weather' dataset contains some missing values, with a total of 21 cells having missing values across 6 rows and 13 columns.
# split the SETTLEMENTDATE into two columns, one is for the date and the other is for the time
price_demand[['DATE','TIME']] = price_demand['SETTLEMENTDATE'].str.split(' ', expand=True)
price_demand.head()
#cast the new date column to datetime format
price_demand['DATE'] = pd.to_datetime(price_demand['DATE'], format='%d/%m/%Y')
date_counts = price_demand['DATE'].value_counts()
#display modified data
price_demand_1 = price_demand
price_demand_1.head()
| REGION | SETTLEMENTDATE | TOTALDEMAND | PRICECATEGORY | DATE | TIME | |
|---|---|---|---|---|---|---|
| 0 | VIC1 | 1/01/2021 0:30 | 4179.21 | LOW | 2021-01-01 | 0:30 |
| 1 | VIC1 | 1/01/2021 1:00 | 4047.76 | LOW | 2021-01-01 | 1:00 |
| 2 | VIC1 | 1/01/2021 1:30 | 3934.70 | LOW | 2021-01-01 | 1:30 |
| 3 | VIC1 | 1/01/2021 2:00 | 3766.45 | LOW | 2021-01-01 | 2:00 |
| 4 | VIC1 | 1/01/2021 2:30 | 3590.37 | LOW | 2021-01-01 | 2:30 |
# drop variables
price_demand_2 = price_demand_1.drop(columns=['SETTLEMENTDATE', 'REGION', 'TIME'])
# Print the modified data
price_demand_2.head()
| TOTALDEMAND | PRICECATEGORY | DATE | |
|---|---|---|---|
| 0 | 4179.21 | LOW | 2021-01-01 |
| 1 | 4047.76 | LOW | 2021-01-01 |
| 2 | 3934.70 | LOW | 2021-01-01 |
| 3 | 3766.45 | LOW | 2021-01-01 |
| 4 | 3590.37 | LOW | 2021-01-01 |
#create new variables
grouped_pd = price_demand_1.groupby(['DATE', 'PRICECATEGORY']).size().unstack(fill_value=0)
grouped_pd.columns = [f"number of {col.lower()}" for col in grouped_pd.columns]
# reset the index to merge the data back into the original DataFrame
grouped_pd.reset_index(inplace=True)
# merge the grouped data back into the original DataFrame
price_demand_3 = price_demand_2.merge(grouped_pd, on='DATE')
price_demand_3.head()
| TOTALDEMAND | PRICECATEGORY | DATE | number of extreme | number of high | number of low | number of medium | |
|---|---|---|---|---|---|---|---|
| 0 | 4179.21 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 |
| 1 | 4047.76 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 |
| 2 | 3934.70 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 |
| 3 | 3766.45 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 |
| 4 | 3590.37 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 |
# Group by DATE and compute the sum of each day
daily_sum = price_demand_3.groupby('DATE').sum().reset_index()
daily_sum = daily_sum[['DATE', 'TOTALDEMAND']]
daily_sum.columns = ['DATE', 'Daily use']
price_demand_4 = price_demand_3.merge(daily_sum, on='DATE')
price_demand_4.head()
| TOTALDEMAND | PRICECATEGORY | DATE | number of extreme | number of high | number of low | number of medium | Daily use | |
|---|---|---|---|---|---|---|---|---|
| 0 | 4179.21 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
| 1 | 4047.76 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
| 2 | 3934.70 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
| 3 | 3766.45 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
| 4 | 3590.37 | LOW | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
#drop variables
price_demand_4 = price_demand_4.drop(columns=['PRICECATEGORY', 'TOTALDEMAND'])
# Print the modified data
price_demand_4.head()
| DATE | number of extreme | number of high | number of low | number of medium | Daily use | |
|---|---|---|---|---|---|---|
| 0 | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
| 1 | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
| 2 | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
| 3 | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
| 4 | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 |
#drop duplicated obervations
price_demand_5 = price_demand_4.drop_duplicates(subset=["DATE"])
# Print the DataFrame with unique rows
price_demand_5.tail()
| DATE | number of extreme | number of high | number of low | number of medium | Daily use | |
|---|---|---|---|---|---|---|
| 11471 | 2021-08-28 | 0 | 0 | 19 | 29 | 209104.93 |
| 11519 | 2021-08-29 | 2 | 3 | 19 | 24 | 224449.35 |
| 11567 | 2021-08-30 | 0 | 1 | 26 | 21 | 232158.40 |
| 11615 | 2021-08-31 | 0 | 0 | 31 | 17 | 226540.29 |
| 11663 | 2021-09-01 | 0 | 0 | 0 | 1 | 4811.27 |
price_demand_6 = price_demand_5.copy()
# create new column ' month'
price_demand_6['month'] =price_demand_6['DATE'].dt.month
price_demand_6.loc[:, 'month'] = price_demand_6['DATE'].dt.month
price_demand_6.head()
| DATE | number of extreme | number of high | number of low | number of medium | Daily use | month | |
|---|---|---|---|---|---|---|---|
| 0 | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 | 1 |
| 47 | 2021-01-02 | 0 | 0 | 48 | 0 | 197990.13 | 1 |
| 95 | 2021-01-03 | 0 | 0 | 48 | 0 | 188742.96 | 1 |
| 143 | 2021-01-04 | 0 | 0 | 48 | 0 | 199281.07 | 1 |
| 191 | 2021-01-05 | 0 | 0 | 48 | 0 | 207680.91 | 1 |
def computing_day(day):
if day.weekday() < 5:
return 'weekday'
else:
return 'weekend'
#create new column 'type of day'
price_demand_7 = price_demand_6.copy()
price_demand_7['type of day'] =price_demand_7['DATE'].apply(computing_day)
price_demand_7.head()
| DATE | number of extreme | number of high | number of low | number of medium | Daily use | month | type of day | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 | 1 | weekday |
| 47 | 2021-01-02 | 0 | 0 | 48 | 0 | 197990.13 | 1 | weekend |
| 95 | 2021-01-03 | 0 | 0 | 48 | 0 | 188742.96 | 1 | weekend |
| 143 | 2021-01-04 | 0 | 0 | 48 | 0 | 199281.07 | 1 | weekday |
| 191 | 2021-01-05 | 0 | 0 | 48 | 0 | 207680.91 | 1 | weekday |
#create new column 'weekday'
price_demand_8 = price_demand_7.copy()
dummy_vars = pd.get_dummies(price_demand_8['type of day'], prefix='', prefix_sep='')
# Concatenate the original DataFrame and the dummy variables
price_demand_8 = pd.concat([price_demand_8, dummy_vars], axis=1)
# Drop the original 'type_of_day' column
price_demand_8 = price_demand_8.drop('type of day', axis=1)
price_demand_8 = price_demand_8.drop('weekend', axis=1)
price_demand_8.head()
| DATE | number of extreme | number of high | number of low | number of medium | Daily use | month | weekday | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-01 | 0 | 0 | 47 | 0 | 185853.37 | 1 | 1 |
| 47 | 2021-01-02 | 0 | 0 | 48 | 0 | 197990.13 | 1 | 0 |
| 95 | 2021-01-03 | 0 | 0 | 48 | 0 | 188742.96 | 1 | 0 |
| 143 | 2021-01-04 | 0 | 0 | 48 | 0 | 199281.07 | 1 | 1 |
| 191 | 2021-01-05 | 0 | 0 | 48 | 0 | 207680.91 | 1 | 1 |
#remove observations related particular dates
price_demand_8 = price_demand_8[price_demand_8['DATE'] != '2021-09-01']
price_demand_8 = price_demand_8[price_demand_8['DATE'] != '2021-01-01']
#create new columns related to season
price_demand_9 = price_demand_8.copy()
price_demand_9["SEASON"] = price_demand_9["DATE"].dt.month.apply(lambda x: "Summer" if x in [1, 2, 12] else "Autumn" if x in [3, 4, 5] else "Winter")
# Convert the season columns to 1s and 0s
season_cols = [ "Summer", "Autumn", "Winter"]
price_demand_9['SEASON'].value_counts()
for col in season_cols:
price_demand_9[col] = price_demand_9["SEASON"].apply(lambda x: 1 if x == col else 0)
price_demand_9.drop('SEASON',inplace=True,axis=1)
price_demand_9.head(10)
| DATE | number of extreme | number of high | number of low | number of medium | Daily use | month | weekday | Summer | Autumn | Winter | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 47 | 2021-01-02 | 0 | 0 | 48 | 0 | 197990.13 | 1 | 0 | 1 | 0 | 0 |
| 95 | 2021-01-03 | 0 | 0 | 48 | 0 | 188742.96 | 1 | 0 | 1 | 0 | 0 |
| 143 | 2021-01-04 | 0 | 0 | 48 | 0 | 199281.07 | 1 | 1 | 1 | 0 | 0 |
| 191 | 2021-01-05 | 0 | 0 | 48 | 0 | 207680.91 | 1 | 1 | 1 | 0 | 0 |
| 239 | 2021-01-06 | 0 | 0 | 48 | 0 | 201497.24 | 1 | 1 | 1 | 0 | 0 |
| 287 | 2021-01-07 | 0 | 0 | 48 | 0 | 201345.90 | 1 | 1 | 1 | 0 | 0 |
| 335 | 2021-01-08 | 0 | 0 | 46 | 2 | 207526.40 | 1 | 1 | 1 | 0 | 0 |
| 383 | 2021-01-09 | 0 | 0 | 42 | 6 | 212630.42 | 1 | 0 | 1 | 0 | 0 |
| 431 | 2021-01-10 | 0 | 0 | 34 | 14 | 230588.61 | 1 | 0 | 1 | 0 | 0 |
| 479 | 2021-01-11 | 0 | 0 | 37 | 11 | 290620.38 | 1 | 1 | 1 | 0 | 0 |
#check dimension
rows, columns = price_demand_9.shape
data = {'Rows': [rows], 'Columns': [columns]}
dimensions_table = pd.DataFrame(data)
display(Markdown("#### Dimensions of updated price_demand "))
display(dimensions_table)
| Rows | Columns | |
|---|---|---|
| 0 | 242 | 11 |
1 Splited 'SETTLEMENTDATE' into 'DATE' and 'TIME' columns.
2 Converted 'DATE' to datetime format.
3 Removed 'SETTLEMENTDATE' and 'TIME' columns, as our aim was to restructure the data into a daily format.
4 Removed 'REGION' variable, rendering it irrelevant for the purposes of our analysis.
5 Added columns for price categories: 'Number of Extreme', 'Number of High', 'Number of Low', and 'Number of Medium'. This transformation enables more efficient data analysis and interpretation.
6 Calculated 'Daily Use' column which aggregates the values from the 'TOTALDEMAND' column for each day.
7 Removed 'PRICECATEGORY' and 'TOTALDEMAND' columns, converting data to daily format.
8 Added 'Month' variable extracted from the 'DATE' column..
9 Added 'Type of Day' variable ('Weekday' or 'Weekend') extracted information from the 'Date' column.
10 Added 'Weekday' dummy variable and remove 'Type of Day' column.
11 Removed rows with incomplete data for 01/01/2021 and 01/09/2021.
12 Added seasonal variables: 'spring', 'summer', 'autumn', and 'winter' derived from 'Month' column.
Now the dataset has 242 observations and 11 variables.
weather_1 = weather.copy()
# convert the format of observation 'Date'
weather_1['Date'] = pd.to_datetime(weather_1['Date'], format='%d/%m/%Y')
#remove observations related particular dates
weather_1 = weather_1[weather_1['Date'] != '2021-09-01']
weather_1 = weather_1[weather_1['Date'] != '2021-01-01']
1 Converted the 'Date' column to a datetime64 format.
2 Removal of Rows with date 01/01/2021 and 01/09/2021 to match the price_demand dataset.
# merge data
merged = pd.merge(price_demand_9, weather_1, left_on='DATE', right_on='Date', how='inner')
display(Markdown("#### Merged Dataframe with Price_Demand and weather"))
# Displaying the table
display(merged.head(2))
| DATE | number of extreme | number of high | number of low | number of medium | Daily use | month | weekday | Summer | Autumn | Winter | Date | Minimum temperature (°C) | Maximum temperature (°C) | Rainfall (mm) | Evaporation (mm) | Sunshine (hours) | Direction of maximum wind gust | Speed of maximum wind gust (km/h) | Time of maximum wind gust | 9am Temperature (°C) | 9am relative humidity (%) | 9am cloud amount (oktas) | 9am wind direction | 9am wind speed (km/h) | 9am MSL pressure (hPa) | 3pm Temperature (°C) | 3pm relative humidity (%) | 3pm cloud amount (oktas) | 3pm wind direction | 3pm wind speed (km/h) | 3pm MSL pressure (hPa) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-02 | 0 | 0 | 48 | 0 | 197990.13 | 1 | 0 | 1 | 0 | 0 | 2021-01-02 | 18.4 | 29.0 | 0.0 | 9.4 | 1.3 | NNW | 30.0 | 8:22 | 23.3 | 52.0 | 7 | NNW | 17 | 1013.3 | 28.7 | 38 | 7.0 | SW | 4 | 1008.5 |
| 1 | 2021-01-03 | 0 | 0 | 48 | 0 | 188742.96 | 1 | 0 | 1 | 0 | 0 | 2021-01-03 | 17.0 | 26.2 | 12.6 | 4.8 | 7.1 | WSW | 33.0 | 17:55 | 18.3 | 100.0 | 8 | WSW | 4 | 1007.7 | 23.5 | 59 | 4.0 | SSW | 2 | 1005.2 |
# drop column 'Date'
merged.drop('Date', axis=1, inplace=True)
#check dimension
rows, columns = merged.shape
data2 = {'Rows': [rows], 'Columns': [columns]}
dimensions_table2 = pd.DataFrame(data2)
display(Markdown("#### Dimensions of 'merged'"))
display(dimensions_table2)
| Rows | Columns | |
|---|---|---|
| 0 | 242 | 31 |
# check missing values
merged_missing = merged.isna().sum()
missingv2 = pd.DataFrame(merged_missing, columns=['Missing Values']).reset_index().rename(columns={'index': 'Column'})
display(Markdown("#### Missing Values in 'merged' "))
display(missingv2)
| Column | Missing Values | |
|---|---|---|
| 0 | DATE | 0 |
| 1 | number of extreme | 0 |
| 2 | number of high | 0 |
| 3 | number of low | 0 |
| 4 | number of medium | 0 |
| 5 | Daily use | 0 |
| 6 | month | 0 |
| 7 | weekday | 0 |
| 8 | Summer | 0 |
| 9 | Autumn | 0 |
| 10 | Winter | 0 |
| 11 | Minimum temperature (°C) | 1 |
| 12 | Maximum temperature (°C) | 1 |
| 13 | Rainfall (mm) | 2 |
| 14 | Evaporation (mm) | 0 |
| 15 | Sunshine (hours) | 0 |
| 16 | Direction of maximum wind gust | 3 |
| 17 | Speed of maximum wind gust (km/h) | 3 |
| 18 | Time of maximum wind gust | 3 |
| 19 | 9am Temperature (°C) | 1 |
| 20 | 9am relative humidity (%) | 1 |
| 21 | 9am cloud amount (oktas) | 0 |
| 22 | 9am wind direction | 1 |
| 23 | 9am wind speed (km/h) | 1 |
| 24 | 9am MSL pressure (hPa) | 2 |
| 25 | 3pm Temperature (°C) | 0 |
| 26 | 3pm relative humidity (%) | 0 |
| 27 | 3pm cloud amount (oktas) | 1 |
| 28 | 3pm wind direction | 0 |
| 29 | 3pm wind speed (km/h) | 0 |
| 30 | 3pm MSL pressure (hPa) | 1 |
# given there are only 6 rows containing missing value, we simply choose to drop them as a way to handle missing value
merged.dropna(inplace=True)
# continue cleaning the dataset to ensure all the columns are named properly(capitalized the first letter)
merged.columns = merged.columns.str.title()
display(merged.head(2))
| Date | Number Of Extreme | Number Of High | Number Of Low | Number Of Medium | Daily Use | Month | Weekday | Summer | Autumn | Winter | Minimum Temperature (°C) | Maximum Temperature (°C) | Rainfall (Mm) | Evaporation (Mm) | Sunshine (Hours) | Direction Of Maximum Wind Gust | Speed Of Maximum Wind Gust (Km/H) | Time Of Maximum Wind Gust | 9Am Temperature (°C) | 9Am Relative Humidity (%) | 9Am Cloud Amount (Oktas) | 9Am Wind Direction | 9Am Wind Speed (Km/H) | 9Am Msl Pressure (Hpa) | 3Pm Temperature (°C) | 3Pm Relative Humidity (%) | 3Pm Cloud Amount (Oktas) | 3Pm Wind Direction | 3Pm Wind Speed (Km/H) | 3Pm Msl Pressure (Hpa) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-01-02 | 0 | 0 | 48 | 0 | 197990.13 | 1 | 0 | 1 | 0 | 0 | 18.4 | 29.0 | 0.0 | 9.4 | 1.3 | NNW | 30.0 | 8:22 | 23.3 | 52.0 | 7 | NNW | 17 | 1013.3 | 28.7 | 38 | 7.0 | SW | 4 | 1008.5 |
| 1 | 2021-01-03 | 0 | 0 | 48 | 0 | 188742.96 | 1 | 0 | 1 | 0 | 0 | 17.0 | 26.2 | 12.6 | 4.8 | 7.1 | WSW | 33.0 | 17:55 | 18.3 | 100.0 | 8 | WSW | 4 | 1007.7 | 23.5 | 59 | 4.0 | SSW | 2 | 1005.2 |
#check data type
dtypes_m = pd.DataFrame(merged.dtypes, columns=['Type'])
dtypes_m1=dtypes_w.T
display(Markdown("#### Data types of 'merged' "))
display(dtypes_m1)
| Date | Minimum temperature (°C) | Maximum temperature (°C) | Rainfall (mm) | Evaporation (mm) | Sunshine (hours) | Direction of maximum wind gust | Speed of maximum wind gust (km/h) | Time of maximum wind gust | 9am Temperature (°C) | 9am relative humidity (%) | 9am cloud amount (oktas) | 9am wind direction | 9am wind speed (km/h) | 9am MSL pressure (hPa) | 3pm Temperature (°C) | 3pm relative humidity (%) | 3pm cloud amount (oktas) | 3pm wind direction | 3pm wind speed (km/h) | 3pm MSL pressure (hPa) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Type | object | float64 | float64 | float64 | float64 | float64 | object | float64 | object | float64 | float64 | int64 | object | object | float64 | float64 | int64 | float64 | object | object | float64 |
1 Merged 'price_demand_9' and 'weather_1' datasets.
2 Removed the 'Date' variable, as it was redundant due to the presence of another variable containing identical information.
3 Addressed missing values by initially attempting the moving average and median/mode imputation methods, using actual data to evaluate accuracy. However, the estimated values displayed significant discrepancies from the actual values. This is attributed to the employed methods being less sensitive to large changes, while the trend of energy usage experiences dramatic fluctuations over time. Given that there are only six rows (less than 3%) with missing values, we opted to simply remove them as a method for handling missing data.
4 Renamed each column in a proper manner.
After merging and cleaning,the dataset 'merge' has 236 observations and 31 variables.
import matplotlib.pyplot as plt
from collections import Counter
# count the value for column
counts = price_demand['PRICECATEGORY'].value_counts()
# create a bar plot
counts.plot(kind='bar')
# add labels and title
plt.xlabel('Price Category')
plt.ylabel('Count')
plt.title('Price Category Histogram')
# display the plot
plt.show()
import plotly.express as px
#plot line chart
fig1 = px.line(price_demand_9, x='DATE', y='Daily use',title='Daily Use by Date')
fig1.show()
From the line chart above, we have observed fluctuations over time, characterized by pronounced upward and downward trends. A relatively stable period of consumption was evident between March and June, whereas January, February, July, and August exhibited more substantial variations in energy usage.
import plotly.express as px
#draw scatter plot
fig2 = px.scatter(price_demand_7, x='DATE', y='Daily use', color='type of day',title='Daily Use by Date and Type of Day')
fig2.show()
Upon examination of the scatter plot provided, it is evident that there are distinct variations in energy usage between weekdays and weekends. Energy consumption generally demonstrates a decrease during weekends. Nonetheless, it is crucial to highlight two particular weekends, specifically on the dates 24/01 and 20/02, where an anomalously elevated level of energy usage was observed.
merged.drop(labels=['Date','Direction Of Maximum Wind Gust ', '9Am Wind Speed (Km/H)','3Pm Wind Speed (Km/H)', '9Am Wind Direction', '3Pm Wind Direction', 'Time Of Maximum Wind Gust'], inplace=True, axis=1)
import seaborn as sns
import matplotlib.pyplot as plt
#create scatter plot metric
x_cols = merged.columns.to_list()
plot_kws = {'color': 'black', 'alpha': 0.3}
sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[:4], plot_kws=plot_kws)
sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[5:9], plot_kws=plot_kws)
sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[9:13], plot_kws=plot_kws)
sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[13:17], plot_kws=plot_kws)
sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[17:21], plot_kws=plot_kws)
sns.pairplot(merged, y_vars=['Daily Use'], x_vars=x_cols[21:24], plot_kws=plot_kws)
sns.set(font_scale=0.5)
display(Markdown("### Scatter Plot Matrix of Daily Use vs. Other Variables"))
plt.show()
It has been observed that the variables 'Minimum Temperature (°C)', 'Maximum Temperature (°C)', '9AM Temperature (°C)' and '3PM Temperature (°C)' exhibit a distinctive V-shaped pattern with respect to the 'Daily Use' parameter, indicating higher energy demand during extreme temperatures.
It should be noticed that in summer, energy demand rises with temperature, while in winter, it increases as temperature falls, reflecting the seasons' impact on consumption patterns.
As the relationship between temperature and the energy consumption is not linear, with moderate temperatures reducing usage and extreme temperatures increasing it. Multiple regression with polynomial terms can capture this non-linearity.
Furthermore, there is a clear positive linear correlation exists between 'Month' and energy consumption, but relationships with other factors need more investigation.
Our initial feature selection is to exclude the following variables:
1 'Date': Instead of using this variable, we incorporated 'month', 'weekday', and 'season' to better capture the relationship between time and daily energy usage.
2 '3Pm Wind Speed (Km/H)' and '9Am Wind Speed (Km/H)': We excluded these variables due to undefined string values, such as 'calm'.
3 'Direction Of Maximum Wind Gust', '9Am Wind Direction', '3Pm Wind Direction': With over 10 unique wind direction values, creating new columns for each is impractical, so we omitted these variables.
4 'Time Of Maximum Wind Gust': Our analysis indicated that this feature is unlikely to significantly influence daily energy usage, so we excluded it from the model.
#display the data
merged.head()
| Number Of Extreme | Number Of High | Number Of Low | Number Of Medium | Daily Use | Month | Weekday | Summer | Autumn | Winter | Minimum Temperature (°C) | Maximum Temperature (°C) | Rainfall (Mm) | Evaporation (Mm) | Sunshine (Hours) | Speed Of Maximum Wind Gust (Km/H) | 9Am Temperature (°C) | 9Am Relative Humidity (%) | 9Am Cloud Amount (Oktas) | 9Am Msl Pressure (Hpa) | 3Pm Temperature (°C) | 3Pm Relative Humidity (%) | 3Pm Cloud Amount (Oktas) | 3Pm Msl Pressure (Hpa) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 48 | 0 | 197990.13 | 1 | 0 | 1 | 0 | 0 | 18.4 | 29.0 | 0.0 | 9.4 | 1.3 | 30.0 | 23.3 | 52.0 | 7 | 1013.3 | 28.7 | 38 | 7.0 | 1008.5 |
| 1 | 0 | 0 | 48 | 0 | 188742.96 | 1 | 0 | 1 | 0 | 0 | 17.0 | 26.2 | 12.6 | 4.8 | 7.1 | 33.0 | 18.3 | 100.0 | 8 | 1007.7 | 23.5 | 59 | 4.0 | 1005.2 |
| 2 | 0 | 0 | 48 | 0 | 199281.07 | 1 | 1 | 1 | 0 | 0 | 16.0 | 18.6 | 2.6 | 3.8 | 0.0 | 41.0 | 16.2 | 98.0 | 8 | 1010.0 | 18.2 | 82 | 8.0 | 1011.0 |
| 3 | 0 | 0 | 48 | 0 | 207680.91 | 1 | 1 | 1 | 0 | 0 | 15.9 | 19.1 | 11.2 | 1.0 | 0.0 | 35.0 | 17.2 | 96.0 | 8 | 1012.5 | 18.2 | 82 | 8.0 | 1013.3 |
| 4 | 0 | 0 | 48 | 0 | 201497.24 | 1 | 1 | 1 | 0 | 0 | 13.7 | 19.2 | 1.2 | 1.0 | 3.2 | 35.0 | 15.2 | 72.0 | 7 | 1020.0 | 18.1 | 63 | 7.0 | 1020.0 |
import pandas as pd
# display correlation metric
cormat = merged.corr()
round(cormat,2)
| Number Of Extreme | Number Of High | Number Of Low | Number Of Medium | Daily Use | Month | Weekday | Summer | Autumn | Winter | Minimum Temperature (°C) | Maximum Temperature (°C) | Rainfall (Mm) | Evaporation (Mm) | Sunshine (Hours) | Speed Of Maximum Wind Gust (Km/H) | 9Am Temperature (°C) | 9Am Relative Humidity (%) | 9Am Cloud Amount (Oktas) | 9Am Msl Pressure (Hpa) | 3Pm Temperature (°C) | 3Pm Relative Humidity (%) | 3Pm Cloud Amount (Oktas) | 3Pm Msl Pressure (Hpa) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Number Of Extreme | 1.00 | 0.67 | -0.50 | 0.17 | 0.34 | 0.24 | 0.03 | -0.20 | -0.05 | 0.22 | -0.33 | -0.29 | 0.02 | -0.29 | -0.15 | -0.29 | -0.34 | 0.22 | 0.04 | 0.15 | -0.28 | 0.13 | 0.00 | 0.16 |
| Number Of High | 0.67 | 1.00 | -0.75 | 0.42 | 0.54 | 0.39 | 0.11 | -0.32 | -0.17 | 0.45 | -0.48 | -0.44 | -0.03 | -0.41 | -0.23 | -0.24 | -0.48 | 0.30 | -0.01 | 0.12 | -0.43 | 0.21 | 0.06 | 0.12 |
| Number Of Low | -0.50 | -0.75 | 1.00 | -0.91 | -0.65 | -0.66 | -0.07 | 0.54 | 0.14 | -0.62 | 0.65 | 0.60 | -0.01 | 0.59 | 0.34 | 0.24 | 0.63 | -0.39 | 0.03 | -0.23 | 0.60 | -0.28 | -0.11 | -0.22 |
| Number Of Medium | 0.17 | 0.42 | -0.91 | 1.00 | 0.56 | 0.66 | 0.04 | -0.54 | -0.11 | 0.59 | -0.60 | -0.56 | 0.02 | -0.55 | -0.32 | -0.16 | -0.57 | 0.34 | -0.05 | 0.23 | -0.55 | 0.25 | 0.13 | 0.21 |
| Daily Use | 0.34 | 0.54 | -0.65 | 0.56 | 1.00 | 0.52 | 0.46 | -0.33 | -0.25 | 0.54 | -0.46 | -0.34 | -0.06 | -0.28 | -0.30 | 0.09 | -0.38 | 0.17 | 0.00 | -0.02 | -0.36 | 0.19 | 0.18 | -0.05 |
| Month | 0.24 | 0.39 | -0.66 | 0.66 | 0.52 | 1.00 | -0.01 | -0.75 | -0.18 | 0.84 | -0.75 | -0.69 | -0.03 | -0.56 | -0.27 | 0.14 | -0.73 | 0.23 | -0.12 | 0.11 | -0.67 | 0.12 | 0.10 | 0.08 |
| Weekday | 0.03 | 0.11 | -0.07 | 0.04 | 0.46 | -0.01 | 1.00 | -0.01 | 0.02 | -0.01 | 0.00 | -0.03 | -0.07 | -0.04 | -0.08 | 0.08 | 0.01 | 0.01 | 0.13 | 0.01 | -0.01 | 0.02 | 0.06 | -0.02 |
| Summer | -0.20 | -0.32 | 0.54 | -0.54 | -0.33 | -0.75 | -0.01 | 1.00 | -0.45 | -0.44 | 0.61 | 0.58 | 0.01 | 0.59 | 0.32 | 0.04 | 0.61 | -0.29 | 0.10 | -0.23 | 0.56 | -0.17 | -0.10 | -0.22 |
| Autumn | -0.05 | -0.17 | 0.14 | -0.11 | -0.25 | -0.18 | 0.02 | -0.45 | 1.00 | -0.61 | 0.09 | 0.09 | 0.02 | -0.08 | -0.07 | -0.25 | 0.08 | 0.03 | -0.04 | 0.27 | 0.09 | -0.02 | 0.00 | 0.29 |
| Winter | 0.22 | 0.45 | -0.62 | 0.59 | 0.54 | 0.84 | -0.01 | -0.44 | -0.61 | 1.00 | -0.64 | -0.60 | -0.03 | -0.43 | -0.22 | 0.22 | -0.62 | 0.23 | -0.05 | -0.07 | -0.59 | 0.17 | 0.09 | -0.09 |
| Minimum Temperature (°C) | -0.33 | -0.48 | 0.65 | -0.60 | -0.46 | -0.75 | 0.00 | 0.61 | 0.09 | -0.64 | 1.00 | 0.70 | 0.04 | 0.66 | 0.10 | 0.03 | 0.91 | -0.32 | 0.19 | -0.32 | 0.66 | -0.05 | 0.05 | -0.28 |
| Maximum Temperature (°C) | -0.29 | -0.44 | 0.60 | -0.56 | -0.34 | -0.69 | -0.03 | 0.58 | 0.09 | -0.60 | 0.70 | 1.00 | -0.13 | 0.63 | 0.48 | -0.07 | 0.82 | -0.31 | -0.19 | -0.09 | 0.97 | -0.46 | -0.25 | -0.18 |
| Rainfall (Mm) | 0.02 | -0.03 | -0.01 | 0.02 | -0.06 | -0.03 | -0.07 | 0.01 | 0.02 | -0.03 | 0.04 | -0.13 | 1.00 | -0.05 | -0.15 | 0.04 | -0.03 | 0.20 | 0.07 | -0.14 | -0.13 | 0.19 | 0.11 | -0.06 |
| Evaporation (Mm) | -0.29 | -0.41 | 0.59 | -0.55 | -0.28 | -0.56 | -0.04 | 0.59 | -0.08 | -0.43 | 0.66 | 0.63 | -0.05 | 1.00 | 0.29 | 0.14 | 0.71 | -0.52 | 0.00 | -0.27 | 0.56 | -0.21 | -0.10 | -0.25 |
| Sunshine (Hours) | -0.15 | -0.23 | 0.34 | -0.32 | -0.30 | -0.27 | -0.08 | 0.32 | -0.07 | -0.22 | 0.10 | 0.48 | -0.15 | 0.29 | 1.00 | -0.04 | 0.22 | -0.26 | -0.59 | 0.21 | 0.50 | -0.56 | -0.72 | 0.14 |
| Speed Of Maximum Wind Gust (Km/H) | -0.29 | -0.24 | 0.24 | -0.16 | 0.09 | 0.14 | 0.08 | 0.04 | -0.25 | 0.22 | 0.03 | -0.07 | 0.04 | 0.14 | -0.04 | 1.00 | 0.08 | -0.37 | 0.00 | -0.42 | -0.12 | -0.08 | 0.13 | -0.43 |
| 9Am Temperature (°C) | -0.34 | -0.48 | 0.63 | -0.57 | -0.38 | -0.73 | 0.01 | 0.61 | 0.08 | -0.62 | 0.91 | 0.82 | -0.03 | 0.71 | 0.22 | 0.08 | 1.00 | -0.46 | 0.05 | -0.27 | 0.76 | -0.16 | -0.01 | -0.28 |
| 9Am Relative Humidity (%) | 0.22 | 0.30 | -0.39 | 0.34 | 0.17 | 0.23 | 0.01 | -0.29 | 0.03 | 0.23 | -0.32 | -0.31 | 0.20 | -0.52 | -0.26 | -0.37 | -0.46 | 1.00 | 0.08 | 0.13 | -0.26 | 0.40 | 0.04 | 0.14 |
| 9Am Cloud Amount (Oktas) | 0.04 | -0.01 | 0.03 | -0.05 | 0.00 | -0.12 | 0.13 | 0.10 | -0.04 | -0.05 | 0.19 | -0.19 | 0.07 | 0.00 | -0.59 | 0.00 | 0.05 | 0.08 | 1.00 | -0.19 | -0.20 | 0.43 | 0.39 | -0.12 |
| 9Am Msl Pressure (Hpa) | 0.15 | 0.12 | -0.23 | 0.23 | -0.02 | 0.11 | 0.01 | -0.23 | 0.27 | -0.07 | -0.32 | -0.09 | -0.14 | -0.27 | 0.21 | -0.42 | -0.27 | 0.13 | -0.19 | 1.00 | -0.03 | -0.14 | -0.29 | 0.96 |
| 3Pm Temperature (°C) | -0.28 | -0.43 | 0.60 | -0.55 | -0.36 | -0.67 | -0.01 | 0.56 | 0.09 | -0.59 | 0.66 | 0.97 | -0.13 | 0.56 | 0.50 | -0.12 | 0.76 | -0.26 | -0.20 | -0.03 | 1.00 | -0.55 | -0.28 | -0.13 |
| 3Pm Relative Humidity (%) | 0.13 | 0.21 | -0.28 | 0.25 | 0.19 | 0.12 | 0.02 | -0.17 | -0.02 | 0.17 | -0.05 | -0.46 | 0.19 | -0.21 | -0.56 | -0.08 | -0.16 | 0.40 | 0.43 | -0.14 | -0.55 | 1.00 | 0.35 | -0.03 |
| 3Pm Cloud Amount (Oktas) | 0.00 | 0.06 | -0.11 | 0.13 | 0.18 | 0.10 | 0.06 | -0.10 | 0.00 | 0.09 | 0.05 | -0.25 | 0.11 | -0.10 | -0.72 | 0.13 | -0.01 | 0.04 | 0.39 | -0.29 | -0.28 | 0.35 | 1.00 | -0.23 |
| 3Pm Msl Pressure (Hpa) | 0.16 | 0.12 | -0.22 | 0.21 | -0.05 | 0.08 | -0.02 | -0.22 | 0.29 | -0.09 | -0.28 | -0.18 | -0.06 | -0.25 | 0.14 | -0.43 | -0.28 | 0.14 | -0.12 | 0.96 | -0.13 | -0.03 | -0.23 | 1.00 |
In our approach to feature selection, we employed correlation coefficients method, to determine the linear or rank correlation between the target variable and each individual feature. By identifying features with the highest absolute values, we can effectively select those that are most relevant for our analysis.
Applying the correlation metric outlined above, we establish an absolute value threshold of 0.3. Consequently, the following features are deemed less significant for our regression model predicting energy usage: 'Rainfall (mm)', 'Evaporation (mm)', 'Speed of Maximum Wind Gust (km/h)', '9AM Relative Humidity (%)', '9AM Cloud Amount (oktas)', '9AM MSL Pressure (hPa)', '3PM Relative Humidity (%)', '3PM Cloud Amount (oktas)', and '3PM MSL Pressure (hPa)'.
It should be noticed that a multicollinearity concern has been identified, as a substantial correlation has been observed among several predictor variables.
We have observed multicollinearity among predictor groups that share similar categories, such as temperature, price category, and time related group. To optimize the model, we tend to remove some of these correlated predictors in the next section for better performance and interpretability.
import numpy as np
merged_1 = merged.copy()
#create new features
merged_1["Min Temp Cubed"] = merged_1["Minimum Temperature (°C)"].apply(lambda x: x**3)
merged_1["Max Temp Cubed"] = merged_1["Maximum Temperature (°C)"].apply(lambda x: x**3)
merged_1["9am Temp Cubed"] = merged_1["9Am Temperature (°C)"].apply(lambda x: x**3)
merged_1["3pm Temp Cubed"] = merged_1["3Pm Temperature (°C)"].apply(lambda x: x**3)
#create new features
merged_1['Month:MinTem'] = merged_1['Month'] * merged_1['Minimum Temperature (°C)']
merged_1['Month:MaxTem'] = merged_1['Month'] * merged_1['Maximum Temperature (°C)']
To improve the predicting power of regression model. We created 2 types of new features:
1 Polynomial terms: To better represent the non-linear V-shaped pattern between energy use and temperature-related variables, we generated cubic polynomial terms for four predictors: 'Min Temp Cubed,' 'Max Temp Cubed,' '9am Temp Cubed,' and '3pm Temp Cubed.'
2 Interaction terms: To account for the relationship between energy demand and temperature fluctuations during different seasons, we combined the effects of 'month' and maximum/minimum temperature. This resulted in two new features: 'Month:MinTem' and 'Month:MaxTem,' which capture the varying impact of temperature on energy usage based on the month.
#display data
merged_1.head()
| Number Of Extreme | Number Of High | Number Of Low | Number Of Medium | Daily Use | Month | Weekday | Summer | Autumn | Winter | Minimum Temperature (°C) | Maximum Temperature (°C) | Rainfall (Mm) | Evaporation (Mm) | Sunshine (Hours) | Speed Of Maximum Wind Gust (Km/H) | 9Am Temperature (°C) | 9Am Relative Humidity (%) | 9Am Cloud Amount (Oktas) | 9Am Msl Pressure (Hpa) | 3Pm Temperature (°C) | 3Pm Relative Humidity (%) | 3Pm Cloud Amount (Oktas) | 3Pm Msl Pressure (Hpa) | Min Temp Cubed | Max Temp Cubed | 9am Temp Cubed | 3pm Temp Cubed | Month:MinTem | Month:MaxTem | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 48 | 0 | 197990.13 | 1 | 0 | 1 | 0 | 0 | 18.4 | 29.0 | 0.0 | 9.4 | 1.3 | 30.0 | 23.3 | 52.0 | 7 | 1013.3 | 28.7 | 38 | 7.0 | 1008.5 | 6229.504 | 24389.000 | 12649.337 | 23639.903 | 18.4 | 29.0 |
| 1 | 0 | 0 | 48 | 0 | 188742.96 | 1 | 0 | 1 | 0 | 0 | 17.0 | 26.2 | 12.6 | 4.8 | 7.1 | 33.0 | 18.3 | 100.0 | 8 | 1007.7 | 23.5 | 59 | 4.0 | 1005.2 | 4913.000 | 17984.728 | 6128.487 | 12977.875 | 17.0 | 26.2 |
| 2 | 0 | 0 | 48 | 0 | 199281.07 | 1 | 1 | 1 | 0 | 0 | 16.0 | 18.6 | 2.6 | 3.8 | 0.0 | 41.0 | 16.2 | 98.0 | 8 | 1010.0 | 18.2 | 82 | 8.0 | 1011.0 | 4096.000 | 6434.856 | 4251.528 | 6028.568 | 16.0 | 18.6 |
| 3 | 0 | 0 | 48 | 0 | 207680.91 | 1 | 1 | 1 | 0 | 0 | 15.9 | 19.1 | 11.2 | 1.0 | 0.0 | 35.0 | 17.2 | 96.0 | 8 | 1012.5 | 18.2 | 82 | 8.0 | 1013.3 | 4019.679 | 6967.871 | 5088.448 | 6028.568 | 15.9 | 19.1 |
| 4 | 0 | 0 | 48 | 0 | 201497.24 | 1 | 1 | 1 | 0 | 0 | 13.7 | 19.2 | 1.2 | 1.0 | 3.2 | 35.0 | 15.2 | 72.0 | 7 | 1020.0 | 18.1 | 63 | 7.0 | 1020.0 | 2571.353 | 7077.888 | 3511.808 | 5929.741 | 13.7 | 19.2 |
#check dimension
#check dimension
rows, columns = merged_1.shape
data6 = {'Rows': [rows], 'Columns': [columns]}
dimensions_table4 = pd.DataFrame(data6)
display(Markdown("#### Dimensions of merged_1"))
display(dimensions_table4)
| Rows | Columns | |
|---|---|---|
| 0 | 236 | 30 |
The dataset 'merged_1' now have 236 observations and 30 variables without missing value.
Our process of computing regression models includes following steps:
1 Splited dataset: Divide data into training and testing subsets for model building and evaluation.
2 Set random seeds: Ensure consistent and reproducible results by setting random seeds before training.
3 Standardized features: Scale and standardize features for better model performance.
4 Selected features: Use PCA or correlation coefficients method to identify and include the most important features in regression models.
5 Used cross-validation: Apply 5-fold cross-validation to reduce bias and get reliable performance estimates.
6 Evaluated error metrics: Assess model performance with RMSE, MAE, and MAPE for different feature counts (K).
7 Examined statistics: Analyze the statistical summary, focusing on coefficients, p-values, and goodness-of-fit metrics.
8 Compared models: Review regression models' performance and fit to select the most appropriate model.
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_selection import SelectKBest, f_regression
from IPython.display import display
# setup function to cauculate MAPE
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
# set updenpendent and independent variables
columns_to_drop = ['Daily Use']
X0 = merged.drop(columns_to_drop, axis=1)
y0 = merged['Daily Use']
# standardize the features except 'Weekday'
scaler = StandardScaler()
X0_scaled = X0.copy()
X0_scaled.loc[:, X0_scaled.columns != 'Weekday'] = scaler.fit_transform(X0.loc[:, X0.columns != 'Weekday'])
results0 = []
#loop out the error metric for k in selected range
for k in range(1, 24):
selector = SelectKBest(score_func=f_regression, k=k)
X0_best_features = selector.fit_transform(X0_scaled, y0)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
lm = LinearRegression()
rmse_scores0 = []
mae_scores0 = []
mape_scores0 = []
r2_scores0 = []
adjusted_r2_scores0 = []
for train_index, test_index in kf.split(X0_best_features):
X_train0, X_test0= X0_best_features[train_index], X0_best_features[test_index]
y_train0, y_test0 = y0.iloc[train_index], y0.iloc[test_index]
model_0 = lm.fit(X_train0, y_train0)
y_pred0 = lm.predict(X_test0)
rmse0 = np.sqrt(mean_squared_error(y_test0, y_pred0))
mae0 = mean_absolute_error(y_test0, y_pred0)
mape0 = mean_absolute_percentage_error(y_test0, y_pred0)
rmse_scores0.append(rmse0)
mae_scores0.append(mae0)
mape_scores0.append(mape0)
avg_rmse0 = np.mean(rmse_scores0)
avg_mae0 = np.mean(mae_scores0)
avg_mape0 = np.mean(mape_scores0)
results0.append([k, avg_rmse0, avg_mae0, avg_mape0])
# create a DataFrame with the results
columns = ['K', 'Average RMSE', 'Average MAE', 'Average MAPE']
results_df0 = pd.DataFrame(results0, columns=columns)
# display the results as a table
display(Markdown("#### Model_0: Error Metric with K"))
display(results_df0)
| K | Average RMSE | Average MAE | Average MAPE | |
|---|---|---|---|---|
| 0 | 1 | 19827.694309 | 15755.788963 | 6.785999 |
| 1 | 2 | 19812.484195 | 15845.667944 | 6.825048 |
| 2 | 3 | 19847.301024 | 15934.037434 | 6.863746 |
| 3 | 4 | 19337.849251 | 15222.538257 | 6.572814 |
| 4 | 5 | 19450.204539 | 15231.908482 | 6.576412 |
| 5 | 6 | 15675.625110 | 11906.911437 | 5.087793 |
| 6 | 7 | 15775.166358 | 11981.998545 | 5.115063 |
| 7 | 8 | 15376.484923 | 11987.946208 | 5.116251 |
| 8 | 9 | 15351.687733 | 11969.679931 | 5.108056 |
| 9 | 10 | 15351.687733 | 11969.679931 | 5.108056 |
| 10 | 11 | 15055.655637 | 11773.906589 | 5.017770 |
| 11 | 12 | 15063.412122 | 11834.740366 | 5.054922 |
| 12 | 13 | 14494.496705 | 11334.305675 | 4.842552 |
| 13 | 14 | 14368.984700 | 11406.994907 | 4.870260 |
| 14 | 15 | 14364.559931 | 11394.036569 | 4.866781 |
| 15 | 16 | 14559.703465 | 11438.526684 | 4.874333 |
| 16 | 17 | 14574.710206 | 11496.693956 | 4.899573 |
| 17 | 18 | 14589.988414 | 11525.274231 | 4.910701 |
| 18 | 19 | 14222.189128 | 11108.421988 | 4.739425 |
| 19 | 20 | 14232.893748 | 11119.996322 | 4.744972 |
| 20 | 21 | 14334.217792 | 11224.352618 | 4.788702 |
| 21 | 22 | 14304.313767 | 11278.616569 | 4.815530 |
| 22 | 23 | 14275.852917 | 11258.095010 | 4.809062 |
Initially, we executed model_0 using variables without polynomial and interaction terms, and then employed PCA for feature selection purposes. This is because PCA can help to address multicollinearity and reduce overfitting by using fewer, uncorrelated predictors.
Then we identified the optimal K value by evaluating the error metrics, which included the average Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE).
It becomes evident that the most favorable model is achieved with K = 19.
In this case, we tend to keep using PCA instead of the correlation coefficient method for feature selection of model_2 since our primary goal is to create a predictive model and we are less concerned about interpreting the relationship between the original predictors and the response variable.
# setup function to cauculate MAPE
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
# set updenpendent and independent variables
columns_to_drop = ['Daily Use']
X2 = merged_1.drop(columns_to_drop, axis=1)
y2 = merged_1['Daily Use']
# Standardize the features except 'Weekday'
scaler = StandardScaler()
X2_scaled = X2.copy()
X2_scaled.loc[:, X2_scaled.columns != 'Weekday'] = scaler.fit_transform(X2.loc[:, X2.columns != 'Weekday'])
results2 = []
for k in range(1, 30):
selector = SelectKBest(score_func=f_regression, k=k)
X2_best_features = selector.fit_transform(X2_scaled, y2)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
lm = LinearRegression()
rmse_scores2 = []
mae_scores2 = []
mape_scores2 = []
for train_index, test_index in kf.split(X2_best_features):
X_train2, X_test2 = X2_best_features[train_index], X2_best_features[test_index]
y_train2, y_test2 = y2.iloc[train_index], y2.iloc[test_index]
model_2 = lm.fit(X_train2, y_train2)
y_pred2 = lm.predict(X_test2)
rmse2 = np.sqrt(mean_squared_error(y_test2, y_pred2))
mae2 = mean_absolute_error(y_test2, y_pred2)
mape2 = mean_absolute_percentage_error(y_test2, y_pred2)
rmse_scores2.append(rmse2)
mae_scores2.append(mae2)
mape_scores2.append(mape2)
avg_rmse2 = np.mean(rmse_scores2)
avg_mae2 = np.mean(mae_scores2)
avg_mape2 = np.mean(mape_scores2)
results2.append([k, avg_rmse2, avg_mae2, avg_mape2])
# Create a DataFrame with the results
columns = ['K', 'Average RMSE', 'Average MAE', 'Average MAPE']
results_df2 = pd.DataFrame(results2, columns=columns)
# Display the results as a table
display(Markdown("#### Model_1: Error Metric with K"))
display(results_df2)
| K | Average RMSE | Average MAE | Average MAPE | |
|---|---|---|---|---|
| 0 | 1 | 19827.694309 | 15755.788963 | 6.785999 |
| 1 | 2 | 19812.484195 | 15845.667944 | 6.825048 |
| 2 | 3 | 19847.301024 | 15934.037434 | 6.863746 |
| 3 | 4 | 19337.849251 | 15222.538257 | 6.572814 |
| 4 | 5 | 19450.204539 | 15231.908482 | 6.576412 |
| 5 | 6 | 15675.625110 | 11906.911437 | 5.087793 |
| 6 | 7 | 15775.166358 | 11981.998545 | 5.115063 |
| 7 | 8 | 15376.484923 | 11987.946208 | 5.116251 |
| 8 | 9 | 15351.687733 | 11969.679931 | 5.108056 |
| 9 | 10 | 15351.687733 | 11969.679931 | 5.108056 |
| 10 | 11 | 15055.655637 | 11773.906589 | 5.017770 |
| 11 | 12 | 11986.739358 | 9368.973404 | 4.005628 |
| 12 | 13 | 11890.785749 | 9188.517916 | 3.930748 |
| 13 | 14 | 11189.916314 | 8630.867520 | 3.677548 |
| 14 | 15 | 10884.574017 | 8438.629805 | 3.590851 |
| 15 | 16 | 10901.983584 | 8459.681882 | 3.600329 |
| 16 | 17 | 10904.313575 | 8447.304293 | 3.597428 |
| 17 | 18 | 10608.828021 | 8068.359296 | 3.433803 |
| 18 | 19 | 10481.079928 | 8029.774965 | 3.417644 |
| 19 | 20 | 10470.849285 | 8032.238976 | 3.418831 |
| 20 | 21 | 9688.956847 | 7460.399061 | 3.197219 |
| 21 | 22 | 9687.305503 | 7496.259981 | 3.211278 |
| 22 | 23 | 9681.640492 | 7523.572427 | 3.219184 |
| 23 | 24 | 9822.021516 | 7518.552680 | 3.210974 |
| 24 | 25 | 9695.099291 | 7540.174332 | 3.228786 |
| 25 | 26 | 9693.578367 | 7562.003919 | 3.241291 |
| 26 | 27 | 9841.822150 | 7685.961912 | 3.292416 |
| 27 | 28 | 9800.718967 | 7681.699673 | 3.297598 |
| 28 | 29 | 9696.357201 | 7519.900861 | 3.224159 |
In Model 2, we have incorporated additional features, which comprise both polynomial and interaction terms, to enhance the analytical capabilities of the model.
It becomes evident that the most favourable model is achieved with K = 21.
# create a table
data = {'Model': ['Model_2', 'Model_0'],
'RMSE': [9688.956847, 14222.189128],
'MAE': [7460.399061,11108.421988],
'MAPE': [3.197219, 4.739425]}
errormetric_compare = pd.DataFrame(data)
display(Markdown("#### Error Metric: Model_0 and Model_2"))
display(errormetric_compare)
| Model | RMSE | MAE | MAPE | |
|---|---|---|---|---|
| 0 | Model_2 | 9688.956847 | 7460.399061 | 3.197219 |
| 1 | Model_0 | 14222.189128 | 11108.421988 | 4.739425 |
Since PCA transforms the original features and creating new, uncorrelated principal components, we did not gegenerate the statistical summary for these two models to compare statistics such as R-square, AIC, BIC or coefficients. To compare these two models, we only focus on error metrics.
As we can see from the table above, the predicting power is improved significantly in model_2 by adding interaction terms and polynominal terms.
Conclusion:
Model_2 has a lower values in error metric, which suggests that it is a better model in terms of goodness-of-fit and parsimony compared to Model_0.
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
#define selected predictors
columns = [ 'Number Of Low', 'Number Of Medium', 'Month', 'Weekday',
'Minimum Temperature (°C)', 'Speed Of Maximum Wind Gust (Km/H)',
'3Pm Relative Humidity (%)', 'Min Temp Cubed', 'Max Temp Cubed',
'3pm Temp Cubed', 'Month:MinTem', 'Month:MaxTem']
#define independent and dependent variables
X3 = merged_1[columns]
y3 = merged_1['Daily Use']
# standardize the features except 'Weekday'
scaler = StandardScaler()
X3_scaled = X3.copy()
X3_scaled.loc[:, X3_scaled.columns != 'Weekday'] = scaler.fit_transform(X3.loc[:, X3.columns != 'Weekday'])
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X3_scaled, y3, test_size=0.2, random_state=42)
X_train_const_3 = sm.add_constant(X_train_3)
# sit the OLS model
model_3 = sm.OLS(y_train_3, X_train_const_3)
# sit the model
fitted_model_3 = model_3.fit()
# get the statistical summary
statistical_summary_3 = fitted_model_3.summary()
display(Markdown("#### Model_3: Statistical Summary"))
print(statistical_summary_3)
OLS Regression Results
==============================================================================
Dep. Variable: Daily Use R-squared: 0.881
Model: OLS Adj. R-squared: 0.873
Method: Least Squares F-statistic: 107.8
Date: Sat, 15 Apr 2023 Prob (F-statistic): 3.96e-74
Time: 13:34:47 Log-Likelihood: -1977.7
No. Observations: 188 AIC: 3981.
Df Residuals: 175 BIC: 4024.
Df Model: 12
Covariance Type: nonrobust
=====================================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------------
const 2.187e+05 1256.167 174.068 0.000 2.16e+05 2.21e+05
Number Of Low -1.017e+04 1992.000 -5.103 0.000 -1.41e+04 -6234.009
Number Of Medium -3238.4336 1742.153 -1.859 0.065 -6676.769 199.902
Month 1.824e+04 4372.439 4.170 0.000 9605.728 2.69e+04
Weekday 2.446e+04 1521.511 16.076 0.000 2.15e+04 2.75e+04
Minimum Temperature (°C) -3.851e+04 5727.733 -6.723 0.000 -4.98e+04 -2.72e+04
Speed Of Maximum Wind Gust (Km/H) 1663.7720 855.265 1.945 0.053 -24.190 3351.734
3Pm Relative Humidity (%) 3948.8279 982.389 4.020 0.000 2009.973 5887.683
Min Temp Cubed 2.259e+04 3435.715 6.575 0.000 1.58e+04 2.94e+04
Max Temp Cubed 1.036e+04 2170.415 4.773 0.000 6075.859 1.46e+04
3pm Temp Cubed 9023.1650 2379.189 3.793 0.000 4327.568 1.37e+04
Month:MinTem 1.66e+04 3380.132 4.912 0.000 9933.299 2.33e+04
Month:MaxTem -2.664e+04 2622.478 -10.160 0.000 -3.18e+04 -2.15e+04
==============================================================================
Omnibus: 14.099 Durbin-Watson: 1.882
Prob(Omnibus): 0.001 Jarque-Bera (JB): 19.195
Skew: -0.484 Prob(JB): 6.79e-05
Kurtosis: 4.231 Cond. No. 28.1
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
To identify the best model using correlation coefficient feature selection, we followed a approach consisting of the following steps:
1 We removed variables with an absolute value of correlation coefficient less than 0.3 to focus on the most relevant features.
2 To reduce the impact of multicollinearity on the model, we removed some of the highly correlated predictors.
3 We ran a linear regression model and examined the statistical summary to assess the significance of the features. Based on the coefficients and p-values derived from the statistical summary, we made adjustments to the features and computed a revised model.
Model_3 allowes us to make accurate predictions and draw meaningful insights from the data.
In analyzing model_3, we have focused on several key statistics.
R-squared and adjusted R-squared:
The R-squared value is 0.881, which means that the model explains 88.1% of the variance in the dependent variable. The adjusted R-squared value is 0.873, which takes into account the number of independent variables and is a more accurate measure of model fit.
F-statistic:
The F-statistic is 107.8, and the associated probability is very small (3.96e-74), which indicates that the model is statistically significant.
Coefficients:
The coefficients for each independent variable show the effect of that variable on the dependent variable, holding all other variables constant. For example, a one-unit increase in "Number Of Low" is associated with a decrease in "Daily Use" of 1017 units, assuming all other variables in the model remain constant. .
P-values:
All variables are statistically significant at 95% level, except for "Number of Medium" and "Speed of Maximum Wind Gust (Km/H)" which have p-values of 0.065 and 0.053 respectively, marginally higher than 0.05.
Multicollinearity:
The condition number = 28.1, indicating no significant multicollinearity issues.
# define the mean_absolute_percentage_error function
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
# prepare the data
columns = [ 'Number Of Low', 'Number Of Medium', 'Month', 'Weekday',
'Minimum Temperature (°C)', 'Speed Of Maximum Wind Gust (Km/H)',
'3Pm Relative Humidity (%)', 'Min Temp Cubed', 'Max Temp Cubed',
'3pm Temp Cubed', 'Month:MinTem', 'Month:MaxTem' ]
X3 = merged_1[columns]
y3 = merged_1['Daily Use']
# standardize the features except 'Weekday'
scaler = StandardScaler()
X3_scaled = X3.copy()
X3_scaled.loc[:, X3_scaled.columns != 'Weekday'] = scaler.fit_transform(X3.loc[:, X3.columns != 'Weekday'])
# split the data using KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
lm = LinearRegression()
rmse_scores3 = []
mae_scores3 = []
mape_scores3 = []
for train_index, test_index in kf.split(X3_scaled):
X_train3, X_test3 = X3_scaled.iloc[train_index], X3_scaled.iloc[test_index]
y_train3, y_test3 = y3.iloc[train_index], y3.iloc[test_index]
model_3 = lm.fit(X_train3, y_train3)
y_pred3 = lm.predict(X_test3)
rmse3 = np.sqrt(mean_squared_error(y_test3, y_pred3))
mae3 = mean_absolute_error(y_test3, y_pred3)
mape3 = mean_absolute_percentage_error(y_test3, y_pred3)
rmse_scores3.append(rmse3)
mae_scores3.append(mae3)
mape_scores3.append(mape3)
avg_rmse3 = np.mean(rmse_scores3)
avg_mae3 = np.mean(mae_scores3)
avg_mape3 = np.mean(mape_scores3)
# create a table for error metrics
error_table = pd.DataFrame({
'Fold': range(1, 6),
'RMSE': rmse_scores3,
'MAE': mae_scores3,
'MAPE': mape_scores3})
display(Markdown("#### Error Metric By Folders: Model_3"))
display(error_table)
# create a table for average error metrics
avg_error_table = pd.DataFrame({
'Error Metric': ['RMSE', 'MAE', 'MAPE'],
'Average Error': [avg_rmse3, avg_mae3, avg_mape3]})
display(Markdown("#### Error Metric By Average: Model_3"))
display(avg_error_table)
| Fold | RMSE | MAE | MAPE | |
|---|---|---|---|---|
| 0 | 1 | 9607.062367 | 7727.804152 | 3.224131 |
| 1 | 2 | 8780.053862 | 7651.410321 | 3.366191 |
| 2 | 3 | 10555.217055 | 8612.614501 | 3.677276 |
| 3 | 4 | 7259.749627 | 5529.940879 | 2.419758 |
| 4 | 5 | 11274.842725 | 8059.533868 | 3.467748 |
| Error Metric | Average Error | |
|---|---|---|
| 0 | RMSE | 9495.385127 |
| 1 | MAE | 7516.260744 |
| 2 | MAPE | 3.231021 |
from sklearn import linear_model
#define variables
X5 = X3_scaled
y5 = merged_1['Daily Use']
# set up training and testing data
X_train5, X_test5, y_train5, y_test5 = train_test_split(X5, y5, test_size=0.2, random_state=42)
lm = linear_model.LinearRegression()
model5 = lm.fit(X_train5, y_train5)
y_test_predictions = lm.predict(X_test5)
import pandas as pd
from IPython.display import display
df = pd.DataFrame({'Actual': y_test5[:10], 'Predicted': y_test_predictions[:10]})
display(Markdown("#### Prediction V.S. Actual Value: Model_3"))
display(df.round(2))
| Actual | Predicted | |
|---|---|---|
| 70 | 213975.93 | 208677.44 |
| 195 | 264775.90 | 258697.55 |
| 183 | 260611.16 | 246132.42 |
| 9 | 290620.38 | 297152.93 |
| 128 | 238268.73 | 239913.45 |
| 110 | 247411.63 | 239325.27 |
| 234 | 250985.58 | 272143.90 |
| 94 | 214600.56 | 223278.95 |
| 152 | 248586.68 | 251981.40 |
| 16 | 203849.71 | 214367.33 |
data = {'Model_0': {'RMSE': 14222.189128, 'MAE': 11108.421988, 'MAPE': 4.739425},
'Model_2': {'RMSE': 9688.96, 'MAE': 7460.40, 'MAPE': 3.20},
'Model_3': {'RMSE': 9495.38, 'MAE': 7516.26, 'MAPE': 3.23}}
# create a dataframe from the dictionary
df = pd.DataFrame.from_dict(data, orient='index')
# display the dataframe using display()
display(Markdown("#### Model Comparison: Error Metric"))
display(df)
| RMSE | MAE | MAPE | |
|---|---|---|---|
| Model_0 | 14222.189128 | 11108.421988 | 4.739425 |
| Model_2 | 9688.960000 | 7460.400000 | 3.200000 |
| Model_3 | 9495.380000 | 7516.260000 | 3.230000 |
As we do not obtain the statistical summar for model_0 and model_2, here, we focus on error metric comparison shown above.
In terms of error metrics, it is obvious that model_2 and model_3, with polynomial terms and interaction variables, have much better predicting power compared with model_0.
It is worth noting that model_3 has a slightly higher MAE and MAPE than model_2 but performs better in RMSE. Therefore, we would consider model_3 and model_2 to have the same level of predicting power.
In conclusion,model_3 has the same predicting power compared with model_2 but has reduced (12) predictors.
We strongly recommend using medel_3 as the forecasting model to predict daily maximum energy demand.
To identify the top 10 most important predictors (except intercept), we can refer to the absolute t-statistic values in the regression results. The higher the t-statistic, the more significant the predictor is. Here are the top 10 predictors in descending order of importance, based on their t-statistics:
1 Weekday
2 Month:MaxTem
3 Minimum Temperature (°C)
4 Min Temp Cubed
5 Number Of Low
6 Max Temp Cubed
7 Month:MinTem
8 Month
9 3Pm Relative Humidity (%)
10 3pm Temp Cubed
Alternatively, we can look at the absolute values of the coefficients. The larger the absolute value of a coefficient, the stronger the effect of that predictor on the dependent variable.
Based on the provided regression results, the top 10 most important predictors, in descending order of the absolute value of coefficients, are:
1 Minimum Temperature (°C)
2 Month:MaxTem
3 Weekday
4 Min Temp Cubed
5 Month:MinTem
6 Month
7 Number Of Low
8 Max Temp Cubed
9 3pm Temp Cubed
10 3Pm Relative Humidity (%)
It is important to highlight that merely two predictors from the original dataset have been identified among the top 10 most significant variables. This observation underscores the effectiveness of the newly created features in enhancing the model's performance.
It's interesting that energy consumption patterns vary between weekdays and weekends, with considerably lower usage on the weekends.
One possible reason is that many businesses and workplaces are closed on weekends, resulting in a decrease in commercial energy consumption.
During weekends, people tend to spend more time at home. This trend suggests that staying at home may be a more energy-efficient option.
Our analysis has revealed a V-shaped pattern in the variables related to temperature concerning the 'Daily Use'. This pattern indicates that energy demand is likely to rise during periods of extreme temperatures, whether hot or cold.
Furthermore, it's important to note that the association between temperature and energy consumption can differ across various seasons. In summers, elevated temperatures typically coincide with a surge in energy demand, while in winters, energy demand tends to rise as temperatures drop.
Our dataset comprises only 8 months of figures, which falls short of a complete annual cycle of variables. This poses a challenge in capturing the entire trend of variables, thus limiting our ability to make highly accurate predictions.
Using data collected during the COVID-19 period to predict post-pandemic daily energy use may introduce potential challenges due to data effectiveness.
The circumstances during the COVID-19 period might diverge significantly from those in a post-pandemic environment, including differences in energy policies, consumer energy usage patterns, and economic conditions.
The unique circumstances during the pandemic could limit the predictive power of the dataset, potentially leading to inaccurate or less reliable forecasts.
Incorporating additional features into our model has the potential to enhance its predictive accuracy. Consider the following possibilities:
1 Numerical Energy Price Data: Instead of using price categories, incorporating precise energy price data may serve as a robust predictor, enabling the creation of a more accurate regression model.
2 'School Day' and 'Holiday' Variables: By introducing these dummy variables, we can better account for variations in energy consumption patterns. These variables complement the existing 'weekday' variable and offer a more comprehensive understanding of daily energy usage trends.
To enhance the accuracy of energy use predictions, it is crucial to gather data from the post-COVID era. Additionally, incorporating data spanning 2-3 years prior to the pandemic can serve as a valuable supplementary resource.
To enhance the accuracy of our predictions, it is advisable to investigate more sophisticated regression models, including:
1 Regularization techniques, like Lasso and Elastic Net, which assist in mitigating multicollinearity issues.
2 Ensemble tree-based models, such as Random Forest and XGBoost, with optimized parameters that have the potential to surpass the performance of our current models.